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CHANGE-POINT MODEL ON NONHOMOGENEOUS POISSON 
PROCESSES WITH APPLICATION IN COPY NUMBER 
PROFILING BY NEXT-GENERATION DNA SEQUENCING 1 

By Jeremy J. Shen and Nancy R. Zhang 

Stanford University 

We propose a flexible change-point model for inhomogeneous 
Poisson Processes, which arise naturally from next-generation DNA 
sequencing, and derive score and generalized likelihood statistics for 
shifts in intensity functions. We construct a modified Bayesian infor- 
mation criterion (mBIC) to guide model selection, and point-wise ap- 
proximate Bayesian confidence intervals for assessing the confidence 
in the segmentation. The model is applied to DNA Copy Number 
profiling with sequencing data and evaluated on simulated spike-in 
and real data sets. 

1. Introduction. For a biological sample, the DNA copy number of a ge- 
nomic region is the number of copies of the DNA in that region within the 
genome of the sample, relative to either a single control sample or a pool 
of population reference samples. DNA Copy Number Variants (CNVs) are 
genomic regions where copy number differs among individuals. Such varia- 
tion in copy number constitutes a common type of population-level genetic 
polymorphism. See Khaja et al. (2007), Redon et al. (2006), Conrad et al. 
(2006) and McCarroll et al. (2006) for detailed discussions on CNV in the 
human population. 

On another front, the genomes of tumor cells often undergo somatic struc- 
tural mutations such as deletions and duplications that affect copy number. 
This results in copy number differences between tumor cells and normal cells 
within the same individual. These changes are often termed Copy Number 
Aberrations or Copy Number Alternations (CNA). There is significant sci- 
entific interest in finding CNVs in normal individuals and CNAs in tumors, 
both of which entail locating the boundaries of the regions in the genome 
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that have undergone copy number change (i.e., the breakpoints), and esti- 
mating the copy numbers within these regions. In this article, we use next- 
generation sequencing data for copy number estimation. 

Microarrays have become a commonly used platform for high-throughput 
measurement of copy number. There are many computational methods that 
estimate copy number using the relative amount of DNA hybridization to 
an array. See Lai et al. (2005), Willenbrock and Fridlyand (2005) and Zhang 
(2010) for a general review of existing methods for array-based data. How- 
ever, the precision of breakpoint estimates with array-based technology is 
limited by its ability to measure genomic distances between probes, which 
currently averages about 1000 bases (1 Kb) on most arrays. Hence, the lower 
limit in the length of detectable CNV events is about 1 Kb. With sequencing 
capacity growing and its cost dropping dramatically, massively parallel se- 
quencing is now an appealing method for measuring DNA copy number. In 
these newer sequencing technologies, a large number of short reads (36-100 
bp) are sequenced in parallel from the fragmentation of sample DNA. Then 
each read is mapped to a reference genome. The basic rationale is that cov- 
erage, defined as the number of reads mapped to a region of the reference 
genome, reflects the copy number of that region in the sample, but with 
many systematic biases and much variability across the genome. Campbell 
et al. (2008) was one of the first to use genome-wide sequencing to detect 
CNA events. The reader is also referred to Medvedev, Stanciu and Brudno 
(2009) for a review of recent studies in CNV/CNA detection using sequenc- 
ing data. More details of the data, with an illustrative example (Figure 2), 
are given in Section 2. 

In the shift from array-based to sequencing-based copy number profiling, 
the main statistical challenge arises from the fundamental change in the type 
of data observed. Array-based data are represented by a large but fixed num- 
ber of continuous valued random variables that are approximately normal 
after appropriate preprocessing, and CNV/CNA signals based on array data 
can be modeled as shifts in mean. Sequencing-based data, as we will discuss 
further in Section 2, are realizations of point processes, where CNV/CNA 
signals are represented by shifts in intensity of the process. While one can 
apply a normal approximation to the large number of discrete events in se- 
quencing data, hence translating the problem into the familiar array-based 
setting, this approach is inefficient and imprecise. A more direct model of the 
point process is preferred. This type of data calls for a new statistical model, 
new test statistics, and, due to the quick growth of sequencing capacity, new 
and highly efficient computing implementation. 

In copy number profiling it is important to assess the confidence in the 
estimated copy numbers. With the exception of Lai, Xing and Zhang (2007), 
existing segmentation methods, both for array data and for sequencing data, 
give a hard segmentation and do not quantify the uncertainty in their 
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change-point estimates. Some methods, such as Olshen et al. (2004) and 
Wang et al. (2005), provide confidence assessments for the called CNV or 
CNA regions, in the form of false discovery rates or p-values, thus inher- 
ently casting the problem in a hypothesis testing framework. However, for 
the analysis of complex regions with nested changes, such as those in tumor 
data, confidence intervals on the copy number, from an estimation perspec- 
tive, are often more useful. Intuitively, the copy number estimate is less 
reliable for a region near a change point than for a region far away from 
any change points. Also, copy number estimates are more reliable for re- 
gions with high coverage than for regions with low coverage, since coverage 
directly affects the number of observations used for estimation. This latter 
point makes confidence intervals particularly important for interpretation 
of results derived from short read sequencing data, where coverage can be 
highly uneven across the genome. In this paper, we take a Bayesian approach 
with noninformative priors to compute point-wise confidence intervals, as 
described in Section 4. 

The proposed methods are based on a simple and flexible inhomogeneous 
Poisson Process model for sequenced reads. We derive the score and gener- 
alized likelihood ratio statistics for this model to detect regions where the 
read intensity shifts in the target sample, as compared to a reference. We 
construct a modified Bayes information criterion (mBIC) to select the appro- 
priate number of change points and propose Bayesian point-wise confidence 
intervals as a way to assess the confidence in the copy number estimates. 
As a proof of concept, we apply seqCBS, our sequencing-based CNV/CNA 
detection algorithm, to a number of actual data sets and found it to have 
good concordance with array-based results. We also conduct a spike- in study 
and compare the proposed method to SegSeq, a method proposed by Chiang 
et al. (2009). 

The methods developed in this paper have been implemented in an open- 
source R-package, SeqCBS, available from CRAN http : / / cran . r-pro j ect . 
org/web/packages/seqCBS/index . html. 

2. Data and existing methods. In a general next-generation genome se- 
quencing/resequencing pipeline, shown in Figure 1, the DNA in the sample 
is randomly fragmented, and a short sequence of the ends of the fragments is 
"read" by the sequencer. After the bases in the reads are called, the reads are 
mapped to the reference genome. There are many different approaches to the 
preparation of the DNA library prior to the sequencing step, some involving 
amplification by polymerase chain reaction, which lead to different distribu- 
tion of reads along the reference genome. When a region of the genome is 
duplicated, fragments from this region have a higher representation, and thus 
its clones are more likely to be read by the sequencer. Hence, when mapped 
to reference genome, this duplicated region has a higher read intensity. Sim- 
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ilarly, a deletion manifests as a decrease in read intensity. Since reads are 
contiguous fixed length sequences, it suffices to keep track of the reference 
mapping location of one of the bases within the read. Customarily, the ref- 
erence mapping location of the 5' end of the read is stored and reported. 
This yields a point process with the reference genome as the event space. 

As noted in previous studies, sequencing coverage is dependent on char- 
acteristics of the local DNA sequence, and fluctuates even when there are 
no changes in copy number, as shown in Dohm et al. (2008). Just as ad- 
justing for probe-effects is important for interpretation of microarray data, 
adjusting for these baseline fluctuations in depth of coverage is important 
for sequencing data. The bottom panels of Figure 3 show the varying depth 
of coverage for Chromosomes 8 and 11 in the sequencing of a normal hu- 
man sample, HCC1954. Many factors cause the inhomogeneity of depth of 
coverage. For example, regions of the genome that contain more G/C bases 
are typically more difficult to fragment in an experiment. This results in 
lower depth of coverage in such regions. Some regions of the genome are 
highly repetitive. It is challenging to map reads from repetitive regions cor- 
rectly onto the reference genome and, hence, some of the reads are inevitably 
discarded as unmappable, resulting in loss of coverage in that region, even 
though no actual deletion has occurred. Some ongoing efforts on the analy- 
sis of sequencing data involve modeling the effects of measurable quantities, 
such as GC content and mappability, on baseline depth. Cheung et al. (2011) 
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demonstrated that read counts in sequencing are highly dependent on GC 
content and mappability, and discussed a method to account for such sys- 
tematic biases. Benjamini and Speed (2011) investigated the relationship 
between GC content and read count on the Illumina sequencing platform 
with a single position model, and identified a family of unimodal curves that 
describes the effect of GC content on read count. We take the approach of 
empirically controlling for the baseline fluctuations by comparing the sample 
of interest to a control sample that was prepared and sequenced by the same 
protocol. In the context of tumor CNA detection, the control is preferably 
a matched normal sample, for it eliminates the discovery of germline copy 
number variants and allows one to focus on somatic CNA regions of the spe- 
cific tumor genome. If a perfectly matched sample is not possible, a carefully 
chosen control or a pool of controls, with sequencing performed on the same 
platform with the same experimental protocol, would work for our method 
as well since almost all of the normal human genome are identical. 

As a simple and illustrative example of the data, we generated points 
according to a nonhomogeneous Poisson Process. Figure 2 shows the point 
processes and the underlying pit) function, defined as the probability that 
a read at genomic position t is from the case/tumor sample, conditional on 
the existence of a read at position t. The model is discussed in more detail 
in Section 3. The y-values for the points are jittered for graphical clarity. 

Existing methods on CNV and CNA detection with sequencing data gen- 
erally follow the change-point paradigm, which is natural since copy number 
changes reflect actual breakpoints along chromosomes. Chiang et al. (2009) 
proposed the algorithm SegSeq that segments the genomes of a tumor and 
a matched normal sample by using a sliding fixed size window, reducing the 
data to the ratio of read counts for each window. Xie and Tammi (2009) 
proposed CNV-seq that detects CNV regions between two individuals based 
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on binning the read counts and then applying methods developed for ar- 
ray data. Yoon et al. (2009) designed a method named Event- Wise Testing 
(EWT) that detects CNV events using a fixed-window scan on the GC con- 
tent adjusted read counts. Ivakhno et al. (2010) proposed a method called 
CNAseg that uses read counts in windows of predefined size, and discov- 
ers CNV using a Hidden Markov Model segmentation. As for single sample 
CNV detection method, Boeva et al. (2011) constructed a computational 
algorithm that normalizes read counts by GC content and estimates the 
absolute copy number. 

These existing methods approach this statistical problem by binning or 
imposing fixed local windows. Some methods utilize the log ratio of read 
counts in the bin or window as a test statistic, thereby reducing the data 
to the familiar representation of array-based CNV/CNA detection, with 
Ivakhno et al. (2010) being an exception in that it uses the difference in 
tumor-normal window read counts in their HMM segmentation. There are 
a number of downsides to the binning or local window approach. First, due 
to the inhomogeneity of reads, certain bins will receive much larger number 
of reads overall than other bins, and the optimal window size varies across 
the genome. If the number of reads in a bin is not large enough, the normal 
approximations that are employed in many of these methods break down. 
Second, by binning or fixed-size window sliding, the estimated CNV/CNA 
boundaries can be imprecise if the actual breakpoints are not close to the 
bin or window boundary. This problem can be somewhat mitigated by re- 
fining the boundary after the change point is called, as done in SegSeq. In 
this paper, we propose a unified model, one that detects the change points, 
estimates their locations, and assesses their uncertainties simultaneously. 

To illustrate and evaluate our method, we apply it to real and spiked- 
in data based on a pair of NCI-60 tumor/normal cell lines, HCC1954 and 
BL1954. The data for these samples were produced and investigated by 
Chiang et al. (2009). The whole-genome shotgun sequencing was performed 
on the Illumina platform and the reads are 36 bp long. After read and 
mapping quality exclusions, 7.72 million and 6.65 million reads were used 
for the tumor (HCC1954) and normal (BL1954) samples, respectively. Newer 
sequencing platforms produce much more massive data sets. 

3. A change-point model on two nonhomogeneous Poisson processes. We 

start with a statistical model for the sequenced reads. Let {X t \t < T} and 
\Xt \t < T} be the number of reads whose first base maps to the left of base lo- 
cation t of a given chromosome for the case and control samples, respectively. 
We can view these count processes as realizations of two nonhomogeneous 
Poisson processes (NHPP), one each for the case and control samples, 

{X t }~NHPP(/i 4 ), 

(1) 

{y t }~NHPP(A t ). 
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The scale t is in base pairs. The scenario where two or more reads are 
mapped to the same genomic position is allowed by letting fit and Xt take 
values larger than 1 and assuming that the observed process is binned at the 
integers. We propose a change-point model on the conditional probability 
of an event at position t being from {^}, given that there is such an event 
from either {Xt} or {Yt}, namely, 

(2) p (t) = —^—= Pk if t k <t<t k+l ,k = I,..., K. 

An example of data according to this model is shown in Figure 2. The 
change-point model assumption can be equivalently expressed as 

tH = hf(t), 

where f(t) = p(t)/[l—p(t)] is piecewise constant with change points {t k }. Of 
course, we require the collection of change points to lie within the observation 
window: 

= t < h < ■ ■ - < t K+1 = T. 

This model does not force the overall intensity of case and control reads 
to be the same. The intensity function Xt reflects the inhomogeneity of the 
control reads. One interpretation of the model is that, apart from constant 
shifts, the fluctuation of coverage in the case sample is the same as that in 
the control sample. This is reasonable if the case and control samples are 
prepared and sequenced by the same laboratory protocol and mapped by 
the same procedure, as we discussed in Section 2. The model would not be 
valid if the intensity functions for samples have significant differences caused 
by nonmatching protocols or experimental biases. 

Let {U\, . . . , U mi }, {Vi, . . . , V m2 } be the event locations for processes {X t } 
and {Yt}, respectively. That is, U and V are the mapped positions of the 
reads from the case and control samples. Let m = m\ + mi be the total 
number of reads from the case and control samples combined. We combine 
the read positions from the case and control processes and keep them ordered 
in the genome position, obtaining combined read positions W\, . . . , W m and 
indicators of whether each event is a realization of the case process or the 
control process Z\ , . . . , Z m : 

() 7=f 1 > if W l e{U 1 ,...,U mi }, 

U 1 10, if Wi £ { Vi , • • • , V^ m2 }. 

For any read i in the combined process, we will sometimes use the term 
"success" to mean that Zi = l, that is, that the read is from the case pro- 
cess. Notice that the collection of change-point locations that can be inferred 
with the data is precisely {W\, . . . , W m }, since we do not have data points 
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to make inference in favor of or against any change points in between obser- 
vations. This means that estimating the copy number between two genome 
positions is equivalent to doing so for the closest pair of reads that span the 
two genome positions of interest. Namely, there is a one-to-one correspon- 
dence between the set of possible change points on {W\, . . . , W m } and the 
set of change points {1 = tq < t\ < ■ • • < tk+i = m} defined on the indices 
{1, . . . , m} through the following: 

{r k =j}^{t k = W j }. 

The above statement can be made formal through the equivariance prin- 
ciple. Consider the sample space [0,T] of any change point, any monotoni- 
cally increasing function (ft : [0,T] — > [0,T], and its natural vector extension 
0(ci,...,c n ) = ((ft(ci),...,(ft{c n )). 

Definition 1. A change-point estimator f is monotone transform equiv- 
ariant if for all monotonically increasing functions (ft: [0, T] —> [0, T] , we have 

T$(u),$(y)) = $(?(u,v)). 

The following theorem shows that any breakpoint estimator f(U, V) sat- 
isfying the equivariance condition can be decomposed into a simpler form. 

Theorem 1. Let f(U,V), which takes values in W, be an estimator of 
the breakpoints. Then f is monotone transform equivariant if and only if 
t(U,V) = Wft, where K = f(Z) taking integer values {1, ...,m} does not 
depend on W. 

Proof. For ease of notation, we let (ft(W) = ((ft(Wi), . . . , <ft(W m )) be 
the natural extension of (ft. Suppose f(U,V) = Wg, where K = f(Z). Note 
that Z is invariant to all monotone transformations of the arrival times, 
hence so is K. Therefore, 0(f (E7, V)) = <f>(Wg) = (</>(W))g = f((ft(U),(ft(V)). 

In the other direction, since f G W and (U, V) contain the same informa- 
tion as (W, Z), we must have f (U, V) = Wg-, wz y Suppose that K(W, Z) de- 
pend on W in a nontrivial way but f satisfies the monotone transform equiv- 
ariance condition. This means that there exist W ^ W such that K(W, Z) ^ 
K(W',Z). But since W and W' are both increasing finite sequences on 
[0,T] with the same number of elements, we must have some (ft{-) that 
j>(W) = W'. Note that (W',Z) induces (U',V) = (<j>(U) , <j>(V)) . However, 

= m'X) = w k{w ^ z) = m^w^) + = 

(ft(f(U,V)). Hence, the equivariance property holds if and only if K is only 
a function of Z. □ 

Theorem 1 implies that any breakpoint estimation procedure, that is, 
monotone transform equivariant uses the estimator K of integer breakpoints 
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based on Z, and that the actual read position W merely serves as a genomic 
scale lookup table. Hence, we can define our change-point model on the 
indices {l,...,m} for the read counts, and use the conditional likelihood 
which depends only on {Z{} but not on the event positions {W\, . . . , W m }: 

(4) p(j) = Pk if T k <j< T k+l . 

For the rest of this section, we will exclusively work with equation (4). 
The mapping positions {W±, . . . , W m } will re-enter our analysis when we 
compute confidence intervals for the copy number estimates, in Section 4. 
Our statistical problem is hence two-fold. First, given K, we need to esti- 
mate the change points {r k }. Second, we need a method to select model 
complexity, as dictated by K. 

We start by considering the following simplified problem: Given a single 
interval spanning reads i to j in the combined process, we want to test 
whether the success probability inside this interval, pij , is different from the 
overall success probability, p. The null model Hq states that p^ = p. We 
derive two statistics to test this hypothesis. The first is adopted from the 
conditional score statistic for a general exponential family model where the 
signal is represented by a kernel function, as discussed in Rabinowitz (1994), 

m / 1 m \ 

(5) Sij = ^2(Z k -p) lli<k<j Z^^k^j) = Z k -p(j + 

k=l V m k=l J i<k<j 

where p = ^2Z k /m. This statistic is simply the difference between the num- 
ber of observed and expected case events under the null model. Its variance 
at the null is 

a% = Var(5 y ) = (l - ^=^) U ~ » + l)p(l " p) 

and is used to standardize Sij for comparison between regions of different 
sizes. The standardized score statistic Tij = Sij/oij is intuitive and simple, 
and would be approximately standard normal if j — i were large. However, 
the normal approximation is not accurate if the number of reads that map 
to the region is low. To attain higher accuracy for regions with low read 
count, observe that X^i<fc<j is a binomial random variable, and use an 
exact binomial generalized likelihood ratio (GLR) statistic, 

Aij = sup h(po,Pij) - supZo(p), 
po,Pij v 

where the null model with one overall success probability parameter p is 
compared with the alternative model with one parameter p^j for inside the 
interval and another parameter po for outside the interval. From the 
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binomial log-likelihood function one obtains 

A,,= WW|) +( l- Zi )l„ g (I=S)} 

ke[i,j] 

where p, po, pij are maximum likelihood estimates of success probabilities 

m 

fc=l 
ke[i,j] 

Po= ^fc/( m -J+*- 1 )- 

The GLR and score statistics allow us to measure how distinct a specific 
interval is compared to the entire chromosome. For the more general 
problem in which is not given but only one such pair exists, we com- 
pute the statistic for all unique pairs of to find the most significantly 
distinct interval. This operation is 0(m 2 ) and to improve efficiency, we have 
implemented a search-refinement scheme called Iterative Grid Scan in our 
software. It works by identifying larger interesting intervals on a coarse grid 
and then iteratively improving the interval boundary estimates. The compu- 
tational complexity is roughly 0(m log m) and hence scales easily. A similar 
idea was studied in Walther (2010). 

In the general model with multiple unknown change points, one could the- 
oretically estimate all change points simultaneously by searching through all 
possible combinations of {t^.}. But this is a combinatorial problem where 
even the best dynamic programming solution [Bellman (1961); Bai and Per- 
ron (2003); Lavielle (2005)] would not scale well for a data set containing 
millions of reads. Thus, we adapted Circular Binary Segmentation [Olshen 
et al. (2004); Venkatraman and Olshen (2007)] to our change-point model 
as a greedy alternative. In short, we find the most significant region (i,j) 
over the entire chromosome, which divides the chromosome in to 3 regions 
(or two, if one of the change points lies on the edge). Then we further scan 
each of the regions, yielding a candidate subinterval in each region. At each 
step, we add the most significant change point(s) over all of the regions to 
the collection of change-point calls. 

Model complexity grows as we introduce more change points. This brings 
us to the issue of model selection: We need a method to choose an appropri- 
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ate number of change points K. Zhang and Siegmund (2007) proposed a so- 
lution to this problem for Gaussian change-point models with shifts in mean. 
Like the Gaussian model, the Poisson change-point model has irregularities 
that make classic measures such as the AIC and the BIC inappropriate. An 
extension of Zhang and Siegmund (2007) gives a Modified Bayes Information 
Criterion (mBIC) for our model, derived as a large sample approximation 
to the Bayes Factor in the spirit of Schwarz (1978): 



where m! is the number of unique values in {W\, . . . , Wm}. 

The first term of mBIC is the generalized log-likelihood ratio for the 
model with K change points versus the null model with no change points. 
In our context, K ideally reflects the number of biological breakpoints that 
yield the copy number variants. The remaining terms can be interpreted 
as a "penalty" for model complexity. These penalty terms differ from the 
penalty term in the classic BIC of Schwarz (1978) due to nondifferentiability 
of the likelihood function in the change-point parameters {rfc}, and also 
due to the fact that the range of values for {r^} grow with the number of 
observations m. For more details on the interpretation of the terms in the 
mBIC, see Zhang and Siegmund (2007). Finally, we report the segmentation 
with K = argmax^ mBIC(A') change points. 

4. Approximate Bayesian confidence intervals. As noted in the Introduc- 
tion, it is particularly important for sequencing data to assess the uncer- 
tainty in the relative read intensity function at each genomic position. We 
approach this problem by constructing approximate Bayesian confidence in- 
tervals. 

Suppose Zi,. . ., Z m are independent realizations of Bernoulli random vari- 
ables with success probabilities {pt}- Consider first the one change-point 
model (which can be seen as a local part of a multiple change-point model), 
where 



Without loss of generality, we may take r € {1, 2, . . . , m}. Assume a uniform 
prior for r on this discrete set. Let St be the number of successes up to and 
including the tth realization, 




+ - log(m) — K log(m') 




if t < t, 
if t > t. 



Ki<t 
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Our goal is to construct confidence bands for pt at each t G {1, 2, . . . , m}. 
Assume a Beta(a,/3) prior for po and p\. If we knew r, then the posterior 
distribution of po and p\ is 

/(Po|^, r = T*) oc f(p )f(S T * bo) 

~Beta(a,/3) • Binom(5 r *; T*,po) 

~ Beta(a + S T * ,(3 + t* — S t * ) , 
f(p 1 \Z,r = T*) oc f(pi)f(S m - S T * \pi) 

~ Beta(a + S m — S T * ,/3 + m — T* — S m + S T * ) . 

Now, without knowing the actual r*, we compute the posterior distribution 
of pt as 



f(p t \Z) = J2f(Pt,r = i\Z) 



m 



J2f(Pt\r = i,Z)-f(r = i\Z). 



■i=i 



As before, the first part of the summation term is a beta distribution, 

f(p W — i Z) — { Beta ( a + + i ~ if t < i, 

\Beta(a + S m — Si, (3 + m — i — S m + Si), if t > i, 

and for the second term, we define the likelihood of the change point at i as 
Li = f(Z\r = i) and observe that with the uniform prior on r, 

f(r = i\Z) oc Li/m oc L i} 

where 

(6) Li = JH p^il-po^dPo- J [J pf'il-Pif'^dPu 

l<j'<t i<j<m 

and d-Po an d dP\ are with respect to the prior distributions of po and p\. 
With Beta(a,/3) priors on po and pi, we can find the closed form expression 
of Lf. 

Li = J II Po'^-Po^dPo- J [J pf'il-pif-Z'dPi 

ISrjS:* i<j<m 

B ^Jp s o^-por Si prHi-Po)^dp 



B{a,p) J 
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(7) 

_ B(a + Si,fi + i — Si)B(a + S m - Si, (3 + m - i - S m + Si) 

_ Tja + S^TiP + i-Si) 
T(a + /3 + i) 

k r(a + S m - Sj)T((3 + m-i-S m + Si) T(a + /3) 2 
X T(a + (3 + m-i) T(a) 2 T(f3) 2 ' 

Hence, we can compute, without knowing the actual value of r, 

m j. 

(8) f(pt\Z) oc y^J{pt\r = i,Z) • - 1 where f = arg max Lj. 

1=1 

Observe that the posterior distribution is a mixture of Beta(-, •) distri- 
butions. In theory, we could compute weights wi = Li/L^ for all positions i 
and then numerically compute (§,1 — %■) quantiles of the posterior beta 
mixture distribution to obtain the Bayesian confidence intervals. However, 
in practice, one can approximate the sum in (8) by 

1 



f(Pt\Z) 



E Wi> e W ' 



^ W if(Pt\ T = hZ) 



Wi>e 



for some small e > 0, hence ignoring the highly unlikely locations for the 
change points. Empirically, we use e = 10 -4 . It is easy to see that the se- 
quence of log likelihood ratios for alternative change points, log-^-, form 
random walks with negative drift as i moves away from the true change 
point r [Hinkley (1970)]. The negative drift depends on the true po, p± and 
is larger in absolute magnitude when the difference between po an d pi is 
larger. With r unknown, since P(\f — t\ <S) can be made arbitrarily close 
to 1 for 5 = o(m), one can make the same random walk construction for 
\og(Li/L+) bounded away by 5 from f , as done in Cobb (1978). This im- 
plies that, for any e > 0, one may find a constant c £iPOiPl such that for any i 
at least c £iPOiPl steps away from f,wi<e with probability approaching 1. 
Hence, it is reasonable to use a small cutoff to produce a close approximation 
to the posterior distribution. 

The extension of this construction to multiple change points is straight- 
forward. It entails augmenting the mixture components of one change point 
with that of its neighboring change points. This gives a computationally 
efficient way of approximating the Bayesian confidence interval using, typ- 
ically, a few hundred mixture components, which has been implemented in 
seqCBS. There is also an extensive body of literature on constructing confi- 
dence intervals and confidence sets for estimators of the change point r. We 
refer interested readers to Siegmund (1988b) for discussion and efficiency 
comparison of various confidence sets in change-point problems. 
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(a) (b) 

Fig. 3. Comparison of seqCBS and array-based CN profiling, (a) HCC/BL1954 Chr 8, 
array, seqCBS, baseline read intensity, (b) HCC/BL1954 Chr 11, array, seqCBS, baseline 
read intensity. 

5. Results. We first applied the proposed method to a matched pair of 
tumor and normal NCI-60 cell lines, HCC1954 and BL1954. Chiang et al. 
(2009) conducted the sequencing of these samples using the Illumina plat- 
form. For comparison with array-based copy number profiles on the same 
samples, we obtained array data on HCC1954 and BL1954 from the NCI- 
60 database at http://www.sanger.ac.uk/genetics/CGP/NCI60/. We ap- 
plied the CBS algorithm [Olshen et al. (2004), Venkatraman and Olshen 
(2007)] with modified BIC stopping algorithm [Zhang and Siegmund (2007)] 
to estimate relative copy numbers based on the array data. 

Figure 3 shows the copy number profiles estimated from the array data 
(top) and from the sequencing data by seqCBS (middle), for two represen- 
tative chromosomes where there appears to be a number of copy number 
alternation events. The bottom plots show the baseline X(t) function esti- 
mated by smoothing the binned counts of the normal sample sequencing 
data. There is clearly inhomogeneity in the rate function. The points for the 
top plots are normalized log ratios for the intensities of each probe on the 
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array, whereas those for the middle plots are log ratios of binned counts for 
the tumor and normal samples. Note that the binned counts for the sequenc- 
ing data are only for illustrative purposes, as the proposed method operates 
on the point processes directly, which are difficult to visualize at the whole- 
chromosome scale. The piecewise constant lines indicate the estimated log 
relative copy numbers and the change-point locations. Note that after ad- 
justing for overall number of read differences between the two samples, the 
relative copy number is estimated by p(t)/(l — p(t)), where pit) is the MLE 
estimate of the success probability of the segment into which t falls. 

The shape of the profile and overall locations for most change-point calls 
are common between the array and sequencing data. That is, CBS and Se- 
qCBS applied on data generated from two distinct platforms generally agree. 
It is interesting that in the regions where a large number of CNA events seem 
to have occurred, our proposed method with sequencing is able to identify 
shorter and more pronounced CNA events. It also appears that the CNA 
calls based on sequencing are smoother in the sense that small magnitude 
shifts in array-based results, such as the change points after 102 Mb of Chro- 
mosome 11, are ignored by seqCBS. Similar observations can be made with 
results on other chromosomes as well. Since we do not know the ground 
truth in these tumor samples, it is hard to assess in more detail the per- 
formance of the estimates. Detailed spike-in simulation studies in the next 
section give a systematic view of the accuracy of the proposed method, as 
compared to the current standard approach. 

Figure 4 is an example illustrating the approximate Bayesian confidence 
interval estimates around two CNA events on Chromosome 8 of HCC1954. 
This is not a typical example among the change-point calls. Typically, the 
signal differences are stronger and, hence, the confidence bands are narrower 
with little ambiguity region. The actual locations of mapped reads are shown 
as tick marks, with ticks at the bottom of the plot representing control reads, 
and ticks at the top representing case reads. The estimated relative copy 
numbers and their point-wise approximate Bayesian confidence intervals are 
shown as black and grey lines, respectively. One can see that the width 
of the confidence intervals depends not only on the number of reads in 
the segment, but also on the distance from the position of interest to the 
called change points, and that the confidence intervals are not necessarily 
symmetric around the estimated copy number. 

6. Performance assessment. To assess the performance of the proposed 
method more precisely, we conducted a spike-in simulation study. We empir- 
ically estimated the underlying inhomogeneous rate function A(i) by kernel 
smoothing of the read counts from the normal sample, BL1954, in Chiang 
et al. (2009). The simulated tumor rate function is then constructed 
by spiking into A(t) segments of single copy gain/loss. Since the length of 
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Reads and Inference around 113088977 
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113138977 



Fig. 4. Bayesian CI, HCC1954 Chr 8. The solid black line is the estimated copy number, 
light gray lines are 95% confidence intervals. The actual locations of the case and control 
reads are shown as tick marks at the top and bottom of the plot, respectively. 



the CNA events influences their ease of detection, we considered a range 
of different signal lengths. Each simulated case sample contains 50 changed 
segments. We compared seqCBS to SegSeq, which is one of the more popular 
available algorithms. For seqCBS, we used mBIC to determine appropriate 
model complexity. We used SegSeq with its default parameters. A change- 
point call was deemed true if it was within 100 reads of a true spike-in change 
point, after using a matching algorithm implemented in the R package clue 
by Hornik (2005, 2010) to hnd minimal-distance pairing between the called 
change points and the true change points. Performance was evaluated by 
recall and precision, defined respectively as the proportion of true signals 
called by the method and the proportion of signals called that are true. 
The simulation was repeated multiple times to reduce the variance in the 
performance measures. 

Figure 5 summarizes the performance comparison at default settings for 
a number of spike-in signal lengths. The horizontal lines are mean recall and 
precision rates for the methods. We see that SeqCBS, used with either the 
score test statistic or the GLR statistic, offers significant improvement over 
the existing method in both precision and recall. The performances of the 
score and GLR statistics are very similar, as their recall and precision curves 
almost overlap. The improvement in precision can be largely attributed to 
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Fig. 5. Recall and precision of seqCBS & SegSeq. 

the fact that mBIC provides a good estimate of model complexity, as can 
be seen in Figure 6(a). 

We studied the performance sensitivity on tuning parameters. SegSeq al- 
lows three tuning parameters: local window size (W), number of false positive 
candidates for initialization (A), and number of false positive segments for 
termination (B). The proposed method has a step size parameter (G) that 
controls the trade-off between speed and accuracy in our Iterative Grid Scan 
component, and hence influences performance. We varied these parameters 
and recorded the performance measures in Table 1. It appears that local win- 
dow size (W) is an important tuning parameter for SegSeq, and in scenarios 
with relatively short signal length, a smaller W = 250 provides significant 
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Fig. 6. Model complexity and timing by seqCBS & SegSeq. (a) Model complexity com- 
parison, (b) Timing comparison. 
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Table 1 

Performance measures and tuning parameters. SegSeq: W= fixed local window size 
(default 500), A — number of false positive candidates for initialization (default 1000), 
B— number of false positive segments for termination (default 10). SeqCBS: Scr= score 
statistic, Bin — GLR statistic, G = IGS power step size (default 10); N/A values indicate 
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improvement in its performance. This echoes with our previous discussion 
that methods using a single fixed window size would perform less well when 
the signals are not of the corresponding length. Some of the parameter com- 
binations for SegSeq result in program running errors in some scenarios, and 
are marked as NA. The step size parameter (G) in SeqCBS, in constrast, 
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controls the rate at which coarse segment candidates are refined and the rate 
at which the program descends into searching smaller local change points, 
rather than defining a fixed window size. A smaller step size typically yields 
slightly better performance. However, the proposed method is not nearly as 
sensitive to its tuning parameters. We also conducted a timing experiment 
to provide the reader with a sense of the required computational resources to 
derive the solution. Our proposed method compares favorably with SegSeq as 
seen in Figure 6(b). The GLR statistic is slightly more complex to compute 
than the score statistic, as is reflected in the timing experiment. However, 
copy number profiling is inherently a highly parallelizable computing prob- 
lem: one may distribute the task for each chromosome among a multi-CPU 
computing grid, hence dramatically reducing the amount of time required 
for this analysis. 

7. Discussion. We proposed an approach based on nonhomogeneous Pois- 
son Processes to directly model next-generation DNA sequencing data, and 
formulated a change-point model to conduct copy number profiling. The 
model yields simple score and generalized likelihood ratio statistics, as well 
as a modified Bayes information criterion for model selection. The proposed 
method has been applied to real sequencing data and its performance com- 
pares favorably to an existing method in a spike-in simulation study. 

Statistical inference, in the form of confidence estimates, is very impor- 
tant for sequencing-based data, since, unlike arrays, the effective sample size 
(i.e., coverage) for estimating copy number varies substantially across the 
genome. In this paper, we derived a procedure to compute Bayesian confi- 
dence intervals on the estimated copy number. Other types of inference, such 
as p- values or confidence intervals on the estimated change points, may also 
be useful. Siegmund (1988b) compares different types of confidence intervals 
on the change points, and the methods there can be directly applied to this 
problem. The reader is referred to Rabinowitz (1994) and Siegmund (1988a) 
for existing methods on significance evaluation. 

Some sequencing experiments produce paired end reads, where two short 
reads are performed on the two ends of a longer fragment of DNA. The 
pairing information can be quite useful in the profiling of structural genomic 
changes. It will be important to extend the approach in this paper to handle 
this more complex data type. 

A limitation of the proposed method and the existing methods is that 
they do not handle allele-specific copy number variants. It is possible to 
extend our model to accommodate this need. With deep sequencing, one 
may assess whether each loci in a CNV is heterozygous, and estimate the 
degree to which each allele contributes to the gain or loss of copy number, 
by considering the number of reads covering the locus with the major allele 
versus those with the minor allele. This is particularly helpful for detecting 
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deletion. Furthermore, in the context of assessing the allele-specific copy 
number, existing SNP arrays have the advantage that the assay targets 
specific sites for that problem, whereas to obtain sufficient evidence of allele- 
specific copy number variants with sequencing, a much greater coverage 
would be required since the overwhelming majority of reads would land in 
nonallelic genomic regions. Spatial models that borrow information across 
adjacent variant sites, such as Chen, Xing and Zhang (2011) and Olshen 
et al. (2011), would be helpful for improving power. 

Recently, there has been increased attention to the problem of simulta- 
neous segmentation of multiple samples [Lipson et al. (2006); Shah et al. 
(2007); Zhang et al. (2010); Siegmund, Yakir and Zhang (2011)]. One may 
also wish to extend this method to the multi-sample setting, where in ad- 
dition to modeling challenges, one also needs to address more sources of 
systematic biases, such as batch effects and carry-over problems. 

Computational challenges remain in this field. With sequencing capacity 
growing at record speed, even basic operations on the data set are resource- 
consuming. It is pertinent to develop faster and more parallelizable solutions 
to the copy number profiling problem. 
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